Week 6 Homework


Source: Python for Biologists

In this folder you’ll find a text file called data.csv, containing some made-up data for a number of genes. Each line contains the following fields for a single gene in this order: species name, sequence, gene name, expression level. The fields are separated by commas (hence the name of the file – csv stands for Comma Separated Values). Think of it as a representation of a table in a spreadsheet – each line is a row, and each field in a line is a column. All the exercises for this section use the data read from this file.

Several species

Print out the gene names for all genes belonging to Drosophila melanogaster or Drosophila simulans.


In [ ]:
# %load data.csv
Drosophila melanogaster,atatatatatcgcgtatatatacgactatatgcattaattatagcatatcgatatatatatcgatattatatcgcattatacgcgcgtaattatatcgcgtaattacga,kdy647,264
Drosophila melanogaster,actgtgacgtgtactgtacgactatcgatacgtagtactgatcgctactgtaatgcatccatgctgacgtatctaagt,jdg766,185
Drosophila simulans,atcgatcatgtcgatcgatgatgcatccgactatcgtcgatcgtgatcgatcgatcgatcatcgatcgatgtcgatcatgtcgatatcgt,kdy533,485
Drosophila yakuba,cgcgcgctcgcgcatacggcctaatgcgcgcgctagcgatgc,hdt739,85
Drosophila ananassae,ttacgatcgatcgatcgatcgatcgtcgatcgtcgatgctacatcgatcatcatcggattagtcacatcgatcgatcatcgactgatcgtcgatcgtagatgctgacatcgatagca,hdu045,356
Drosophila ananassae,gcatcgatcgatcgcggcgcatcgatcgcgatcatcgatcatacgcgtcatatctatacgtcactgccgcgcgtatctacgcgatgactagctagact,teg436,222

In [6]:
# Look at csv module

import csv
with open('data.csv') as csvfile:
    raw_data = csv.reader(csvfile, delimiter=' ', quotechar='|')
    for row in raw_data:
        print(', '.join(row))


Drosophila, melanogaster,atatatatatcgcgtatatatacgactatatgcattaattatagcatatcgatatatatatcgatattatatcgcattatacgcgcgtaattatatcgcgtaattacga,kdy647,264
Drosophila, melanogaster,actgtgacgtgtactgtacgactatcgatacgtagtactgatcgctactgtaatgcatccatgctgacgtatctaagt,jdg766,185
Drosophila, simulans,atcgatcatgtcgatcgatgatgcatccgactatcgtcgatcgtgatcgatcgatcgatcatcgatcgatgtcgatcatgtcgatatcgt,kdy533,485
Drosophila, yakuba,cgcgcgctcgcgcatacggcctaatgcgcgcgctagcgatgc,hdt739,85
Drosophila, ananassae,ttacgatcgatcgatcgatcgatcgtcgatcgtcgatgctacatcgatcatcatcggattagtcacatcgatcgatcatcgactgatcgtcgatcgtagatgctgacatcgatagca,hdu045,356
Drosophila, ananassae,gcatcgatcgatcgcggcgcatcgatcgcgatcatcgatcatacgcgtcatatctatacgtcactgccgcgcgtatctacgcgatgactagctagact,teg436,222

In [5]:
# Look at csv module

import csv
with open('data.csv') as csvfile:
    raw_data = csv.reader(csvfile)
    for row in raw_data:
        print(row)


['Drosophila melanogaster', 'atatatatatcgcgtatatatacgactatatgcattaattatagcatatcgatatatatatcgatattatatcgcattatacgcgcgtaattatatcgcgtaattacga', 'kdy647', '264']
['Drosophila melanogaster', 'actgtgacgtgtactgtacgactatcgatacgtagtactgatcgctactgtaatgcatccatgctgacgtatctaagt', 'jdg766', '185']
['Drosophila simulans', 'atcgatcatgtcgatcgatgatgcatccgactatcgtcgatcgtgatcgatcgatcgatcatcgatcgatgtcgatcatgtcgatatcgt', 'kdy533', '485']
['Drosophila yakuba', 'cgcgcgctcgcgcatacggcctaatgcgcgcgctagcgatgc', 'hdt739', '85']
['Drosophila ananassae', 'ttacgatcgatcgatcgatcgatcgtcgatcgtcgatgctacatcgatcatcatcggattagtcacatcgatcgatcatcgactgatcgtcgatcgtagatgctgacatcgatagca', 'hdu045', '356']
['Drosophila ananassae', 'gcatcgatcgatcgcggcgcatcgatcgcgatcatcgatcatacgcgtcatatctatacgtcactgccgcgcgtatctacgcgatgactagctagact', 'teg436', '222']

In [7]:
# Look at csv module

import csv
with open('data.csv') as csvfile:
    raw_data = csv.reader(csvfile)
    for row in raw_data:
        if row[0] == 'Drosophila melanogaster' or row[0] == 'Drosophila simulans':
            print(row[2])


kdy647
jdg766
kdy533

In [8]:
import csv
with open('data.csv') as csvfile:
    raw_data = csv.reader(csvfile)
    for row in raw_data:
        if row[0] in ['Drosophila melanogaster', 'Drosophila simulans']:
            print(row[2])


kdy647
jdg766
kdy533

Length range

Print out the gene names for all genes between 90 and 110 bases long.


In [11]:
import csv
with open('data.csv') as csvfile:
    raw_data = csv.reader(csvfile)
    for row in raw_data:
        if len(row[1]) >= 90 or len(row[1]) <= 110:
            print(row[2])


kdy647
jdg766
kdy533
hdt739
hdu045
teg436

AT content

Print out the gene names for all genes whose AT content is less than 0.5 and whose expression level is greater than 200.


In [15]:
def is_at_rich(dna):
    length = len(dna)
    a_count = dna.upper().count('A')
    t_count = dna.upper().count('T')
    at_content = (a_count + t_count) / length
    return at_content < 0.5

In [16]:
import csv
with open('data.csv') as csvfile:
    raw_data = csv.reader(csvfile)
    for row in raw_data:
        if is_at_rich(row[1]) and int(row[3]) > 200:
            print(row[2])


teg436

Complex condition

Print out the gene names for all genes whose name begins with “k” or “h” except those belonging to Drosophila melanogaster.


In [20]:
import csv
with open('data.csv') as csvfile:
    raw_data = csv.reader(csvfile)
    for row in raw_data:
        if (row[2].startswith('k') or row[2].startswith('h')) and row[0] != 'Drosophila melanogaster':
            print(row[2])


kdy533
hdt739
hdu045

High low medium

For each gene, print out a message giving the gene name and saying whether its AT content is high (greater than 0.65), low (less than 0.45) or medium (between 0.45 and 0.65).


In [21]:
def at_percentage(dna):
    length = len(dna)
    a_count = dna.upper().count('A')
    t_count = dna.upper().count('T')
    at_content = (a_count + t_count) / length
    return at_content

In [22]:
import csv
with open('data.csv') as csvfile:
    raw_data = csv.reader(csvfile)
    for row in raw_data:
        at_percent = at_percentage(row[1])
        if at_percent > 0.65:
            print('AT content is high')
        elif at_percent < 0.45:
            print('AT content is high')
        else:
            print('AT content is medium')


AT content is high
AT content is medium
AT content is medium
AT content is high
AT content is medium
AT content is medium

In [ ]: